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A speaker - independent  speech  recognition  system  was  constructed 
which  implements  a solution  to  one  o£  the  most  difficult  and  most 
inport  ant  problems  in  speech,  that  of  speaker-to-speaker  variability. 
The  system,  which  recognizes  words  in  naturally  spoken,  uncontrolled 
text,  is  based  on  a theory  of  speech  perception  which  is  consistent 
with  the  linguistic  uni vers als  of  world  languages.  The  representa- 
tion is  invariant  under  certain  adaptive  transformations  which 
render  the  speech  speaker- independent . 

The  problem  of  speaker-to-speaker  variability  was  solved  by 
reducing  the  multi-speaker  problem  to  a single -speaker  proposition. 

A single  speaker  may  train  the  system  to  recognize  a given  vocabulary. 
A subsequent  speaker  need  speak  only  a predetermined  sentence  or 
word  sequence  to  transform  the  system  for  operation  on  his  voice. 

Performance  has  been  evaluated  using  constraint -free  speech, 
spoken  in  natural  word  sequences.  Recognition  results  for  25 
American  male  speakers  are  given,  indicating  an  overall  recognition 
accuracy  of  97.61. 

It  is  concluded  that  the  method  of  speaker  transformation  has 
produced  marked  improvement  in  the  recognition  of  connected  speech, 
and  that  the  method  is  applicable  to  a multiplicity  of  speech 
recognition  systems  for  overcoming  speaker-to-speaker  variability 
as  well  as  variations  due  to  vocabulary  and  language . 
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nents  and  optimizations,  it  is  envisioned  that  this  technique  win  ne  in- 
valuable in  other  areas  of  Automatic  Speech  Recognition  such  as  keyword  re- 
cognition (word  spotting),  speaker  identification,  and  language  classification 
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I.  INTRODUCTION 


The  theory  of  speech  perception  utilized  in  this  work  falls  within  a 
more  general  approach  to  the  problem  of  perception.  This  approach  is  evolu- 
tionary in  its  philosophy  and  statistical  in  its  methods.  The  essence  of 
the  approach  is  as’  follows:  First,  consider  the  physical  properties  of  the 
stimulus  energy  and  its  statistical  distributions  in  the  environment; 
second,  consider  the  needs  of  the  organism  in  terms  of  individual  and 
social  survival.  Given  suitable  neural  material  and  biochemical  processes, 
and  given  enough  time  for  evolutionary  forces  to  assert  themselves,  we  then 
postulate  that  the  perceptual  devices  evolved  proceed  toward  a functional 
optimum.  When  supplemented  with  additional  conditions  of  metabolic  and 
constructional  nature,  and  perhaps  sane  restrictions  related  to  early  genetic 
fixation,  the  above  statements  are  assumed  to  provide  a suitable  foundation 
to  deduce  mathematically  the  overall  properties  of  a perceptual  device. 

As  with  all  evolutionary  processes,  the  perceptual  organization  is  a 
matter  of  compromise  and  balance  between  various  stimuli  in  terms  of  their 
relevance  and  statistical  distribution.  The  statistical  attitude  is  here 
quite  basic  because  perceptual  devices  are  not  designed  for  specific  stimuli. 
Furthermore,  we  are  not  interested  in  specific  designs  or  mechanisms,  but 
rather,  in  the  functional  behavior  of  ensembles  of  devices  under  varying 
distributions  of  stimuli.  We  are  interested  in  an  optimal  functional  repre- 
sentation so  as  to  minimize  the  dependence  of  perception  on  speaker  vocabu- 
lary or  language.  In  other  words  we  are  interested  in  a method  of  perception 
and  recognition  based  on  the  universal  characteristics  of  human  speech. 

Our  general  evolutionary  adaptive  approach  to  speech  perception  is 
described  in  our  previous  reports  (See  Reference  Section).  Recently  signifi- 
cant advances  are  made  in  this  theory  which  are  as  follows : 

1.  Hie  need  and  the  form  of  a fourth  expansion  function  is 
established  by  detailed  perceptual  experiments. 

2.  Studies  of  variability  in  rate  and  manner  of  speaking  have 
led  to  a more  efficient  sampling -normalization  procedure. 
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The  addition  of  these  two  advances  into  our  understanding  led  to  a 
better  control  over  the  variabilities  inherent  in  speech  so  that  we  are  now 
able  to  recognize  continuous  speech  consisting  of  a small  vocabulary  with 
sufficiently  high  accuracy  to  render  the  machine  usable  in  practical  appli- 
cations. 

The  system  is  at  present  in  an  experimental  stage  and  has  not  yet  been 
broi^ht  to  a real-time  operation.  This  is  not  a barrier  to  real-time 
operation  since  the  method  can  be  turned  into  machine  language  within  a 
reasonable  length  of  time. 

Many  inprovements  and  optimizations  are  being  investigated.  We  are 
planning  to  incorporate  these  refinements  and  optimizations  in  the  future. 
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II.  TOE  FOURTH  FUNCTION 


Our  previous  inplementations  of  the  percepaial  space  have  been  based 
upon  three  expansion  functions.  The  first  of  these  is  a measure  of  inten- 
sity. The  other  two  make  possible  a two-dimensional  representation  after 
intensity  normalization.  These  two  functions  are  similar  to  simj>  and,  cos4 . 
The  theory  allows  for  additional  functions  of  the  form  similar  to  sin  (n$) 
and  cos  (n<$>)  for  n - 2 and  higher.  We  have  suspected  for  a long  time  that, 
at  least  one  additional  function  would  make  possible  more  accurate  distinc- 
tions between  the  continuous  speech  sounds  (vowels,  nasals,  continuants), 
but  until  last  year  we  were  unable  to  find  a suitable  fourth  function 
satisfying  perceptual  requirements. 

It  was  suggested  by  Professor  Roman  Jakob  son  during  a discussion  that 
linguistic  universals  emerging  from  comparative  studies  of  world  languages 
and  especially  the  distinctive  feature  analysis  seem  to  imply  an  8-vowel 
cubic  representation.  The  vowel  cube  so  conceived  may  be  considered  as 
existing  inside  a spherical  perceptual  space.  This  is  a generalization  of 
our  previous  speech  circle  into  a sphere.  Hie  spherical  perceptual  space 
is  obtained  by  utilizing  four  expansion  functions  (three  plus  normalized 
intensity).  The  resulting  vowel  cube  is  shewn  in  Figure  1. 

In  the  cubic  representation,  the  eight  basic  sounds  are  associated 
with  the  eight  vertices  of  the  cube,  and  pairs  appearing  on  opposite  vertices 
are  complementary.  The  following  complementary  pairs  have  been  previously 
verified  experimental ly  in  connection  with  our  two-dimensional  representa- 
tion. 


The  phoneme  6 (similar  to  the  vowel  in  bird)  is  predicted  by  the 
distinctive  feature  theory  as  one  of  the  other  two  sounds  for  the  vowel 
cube,  but  the  identity  of  the  last  remaining  sound  is  not  possible  to  pre 


diet  by  the  distinctive  features  theory  uniquely.  We  have  undertaken  to 
identify  this  sound  and  have  performed  various  tests. 

If  the  cube  is  projected  onto  two  dimensions  perpendicular  to  a line 
through  two  apposite  vertices*  the  result  is  a hexagon  as  shown  in  Figure  2 

If  5 is  at  the  tap  (closest  to  the  reader's  eye)  then  fi,  o,  and  e are 
on  a higher  level  than  u,  a*  and  i.  This  feature  is  used  in  a computerized 
method  of  identifying  the  fourth  function  of  the  perceptual  space.  It  was 
then  clear  that  the  missing  complementary  (namely*  the  complementary  to  8) 
must  be  a sound  simultaneously  similar  to  u,  a,  and  i.  By  repeated  experi- 
mentation, a complementary  sound  to  5 was  found  and  it  was  verified  by  in- 
verse filter  listening  experiments  and  by  its  ability  to  produce  the  theo- 
retically predicted  fourth  function  the n used  with  other  phonemes  of  the 
afce.  This  sound  that  passed  both  of  these  tests  is  similar  to  nasal  n (as 
in  king)*  but  in  sustained  fonn.  It  completes  the  set  of  4 complementary 
pairs  which  are 


An  6 was  spoken  into  the  microphone  and  its  spectrum  was  obtained  on  the 
FTC  spectrun  analyzer.  The  spectral  display  was  carefully  traced  in  pen  on 
the  face  of  the  oscilloscope  display.  A flat  spectrum  (narrow  pulse)  was 
next  fed  to  the  spectrun  analyzer*  and  filter  gains  were  adjusted  until  the 
spectrun  exactly  matched  the  original  6.  Listeners  verified  that  percep- 
tually* the  resulting  sound  was  6. 

Listeners  were  then  asked  to  first  listen  to  the  8 until  they  became 
fully  adapted  to  it,  then  imnediately  switch  to  the  flat  spectrun.  Through 
perceptual  adaptation  the  flat  spectrun  is  expected  to  assume  the  form  of 
the  complement  of  8.  Indeed  listeners  most  often  heard  nasals,  n and  some- 
times m or  n. 

In  a similar  way*  the  reverse  relationship  (that  8 is  the  conplement 
of  m*  n)  was  verified.  In  almost  every  case  £steners  heard  8.  In  these 
contrast  experiments,  it  was  fouid  to  be  of  some  inportance  that  subjects 


were  convinced  of  the  identity  of  the  first  sound  before  switching  to  the 
flat  spectrun.  The  resulting  sound  always  seemed  to  be  the  complement  of 
what  the  subject  thought  he  heard,  rather  than  what  was  physically  presented. 

As  an  independent  experimental  verification  of  these  concepts,  correla- 
tions resulting  from  a word  recognizer  were  interpreted  as  cosines  of  the 
angles  of  vectors  representing  them  in  a four  dimensional  space.  Then  the 
recovered  distances  are  used  to  construct  a three-dimensional  figure  which 
turned  out  to  be  approximately  a cube,  as  predicted.  These  results  are 
being  implemented  into  the  program  of  word  speech  recognition  for  practical 
applications . 

A conputer  program  was  written  for  directly  graphing  the  fourth  function 
as  obtained  from  one  person  or  a small  number  of  people.  The  method  consists 
of  sunming  the  spectra  of  sounds  above  the  center  plane  of  the  vowel  cube  and 
summing  those  below  the  center  plane,  then  obtaining  the  difference  of  the 
two.  Sounds  actually  spoken  were 


Four  examples  of  each  phoneme  were  spoken  by  each  speaker  covering  the  range 
of  normal  pitches.  Figure  3 gives  the  result  as  averaged  over  ten  speakers. 
The  resemblance  of  this  curve  to  a sin  2$  function  is  quite  apparent. 

In  the  implementation  of  a four- dimensional  representation  one  could 
therefore  chose,  as  primary  vowels,  u,  a,  i and  6.  If  the  vocabulary  does 
not  contain  6 one  could  replace  o with  e without  sacrificing  anything  in  the 
representation  of  that  vocabulary. 


III.  TIME-NOFMALIZATION 


Tine  noimalizatian  is  part  of  the  more  general  problem  of  how  to  sample 
speech  for  efficient  recognition.  The  general  problem  of  sanpling  has  many 
aspects  only  one  of  which  is  time-normalization. 

The  basic  idea  of  time  normalization  is  to  sample  the  speech  so  as  to 
render  it  more  or  less  time  independent.  It  is  usually  achieved  in  a crude 
way  by  taking  the  sanples  only  after  a significant  amount  of  spectral  change 
occurs.  This,  however,  is  not  sufficient  because  the  intensity,  rate  of 
change  of  intensity,  voicing  etc.  are  part  of  the  recognition  criteria.  The 
tine-normalization  was  therefore  improved  by  adding  additional  parameters 
which  represent  the  influence  of  these  variables.  Intensity  and  rate  of 
change  of  intensity  are  controlled  by  two  parameters.  Voicing  is  used  to 
label  voiced  and  unvoiced  sanples.  Additional  improvements  are  made  in  the 
fricative  and  gap  areas  to  reduce  oversampling.  In  the  present  sanpling 
procedure  we  have  utilized  correlation,  peak  normalized  intensity,  voice- 
unvoice  and  the  channel  characteristics  during  silences.  The  improved 
sampling  procedure  so  obtained  has  been  tested  and  is  working  fairly  satis- 
factorily. The  computer  printouts  of  recognized  words  given  elsewhere  in 
the  report  are  produced  under  this  sanpling  procedure. 

We  point  out,  however,  that  even  this  improved  sanpling  method  is  not 
fully  satisfactory,  because  it  is  still  dependent  cm  how  saturated  the 
spoken  words  are.  In  particular,  distinctly  pronounced  or  saturated  words 
result  in  more  sanples  than  the  ones  which  are  not  saturated.  The  n»in 
problem  caused  by  such  mismatch  of  sanples  is  that  they  get  out  of  step  and 
occasional  recognition  errors  occur. 

There  are  various  ways  of  overcoming  this  problem.  One  is  to  subdivide 
the  word  into  smaller,  phoneme  size,  components  and  prevent  the  matching  as 
a whole  from  getting  out  of  step  by  first  processing  the  components.  Another 
is  to  provide  word  alternatives.  A third  would  be  to  take  into  account  the 
saturation  explicitly  in  the  course  of  sanpling.  This  is  equivalent  to 
introducing  another  parameter  to  the  sapling  procedure.  Roughly  speaking 


this  is  analogous  to  an  image  sharpening  operation.  We  have  devised  a 
method  of  renormalizing  the  syllabic  segments  to  achieve  the  equivalent  of 
an  image  sharpening  or  contrast  enhancement  process.  This,  however,  is  not 
yet  implemented.  When  all  of  these  are  done  the  procedure  is,  however,  no 
longer  equivalent  to  a standard  time-normalization  or  time-warping  technique. 
Our  experience  to  date  seems  to  show  that  one  of  the  most  crucial  component 
of  a continuous  speech  recognition  system  is  the  sampling  procedure.  We 
believe  we  have  achieved  a reasonably  good  sanpling  process  and  expect  to 
improve  it  significantly  in  the  future. 


IV.  IMPLEMENTATION 


Yfe  have  isplemented  a speaker-  independent  connected  speech  recognizer 
based  on  a generalization  of  the  techniques  developed  and  tested  under  pre- 
vious contracts.  The  techniques  include  the  perceptual  representation  of 
vowels  and  its  application  to  speaker  transformations.  We  have  also  devel- 
oped and  implemented  a tine-normalized  sampling  procedure  and  applied  it  to 
connected  speech  recognition. 

1.  Extraction  of  Vowels 

We  have  built  the  necessary  hardware  and  devised  the  necessary  software 
programs  to  extract  a set  of  four  suitable  vowels  from  a connected  utterance 
of  comnon  words. 

The  concept  of  extracting  vowels  from  comnon  words  has  the  following 
practical  advantages: 

a)  A speaker  need  not  be  trained  to  say  the  vcwels,  instead,  only 
a set  of  comnon  words  are  required  from  him. 

b)  A speaker  is  more  likely  to  give  consistent  utterances  in  comnon 
words  than  in  isolated  vowels. 

The  present  software  requires  a prescribed  set  of  common  words  which  we 
chose  to  be  the  sequence  "one  three  seven”.  Upon  receipt  of  this  utterance, 
which  can  be  spoken  in  a discrete  or  connected  manner,  the  system  proceeds 
to  extract  vowels. 

The  vowel  ”U"  is  taken  at  the  onset  of  the  voiced  portion  of  1. 

The  vowel  ”A”  is  taken  at  the  dominant  voiced  portion  of  1. 

The  vowel  "I"  is  extracted  from  the  region  following  the  dominant 

voiced  portion  of  3. 

The  vowel  "F*  is  extracted  from  the  dominant  voiced  portion  of  7. 

The  extracted  vowels,  each  represented  by  their  spectrun,  are  stored 
in  the  foxm  of  filter  outputs,  which  contain  16  lumbers  for  each  vcwel. 
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voiced  or  unvoiced 


. 


c)  It  distinguishes  between  voiced  and  unvoiced  signals 

d)  It  places  esphasis  an  the  sequential  ordering  of  sazples 

e)  It  takes  into  account  the  rate  of  tirae-developnwit  of  the  signal. 

resentation  of  Normalized 


each  noraalized  sample  is  originally  obtained  as  a set  of  16  filter 
readings.  For  the  purpose  of  speaker -independent  transformation,  these 
samples  are  expanded  in  terns  of  the  four  vowels  extracted  above.  The 
resulting  representation  for  the  sanple  then  consists  of  4 coefficients 
a*  6,  y,  and  X.  Each  of  the  noraalized  sanples  is  represented  as  in 

P-oU+SA+yI+XE 

In  order  to  represent  a given  saaple  of  16  filter  readings  by  the  above  fora 
the  following  steps  are  taken: 

a)  The  4 x 4 symmetric  Matrix  containing  the  correlations,  XY,  between 
any  pair  of  the  base  functions  is  calculated 

UU  AU  IU  HJ 
UA  AA  UA  EA 

M in  ai  ii  ei 

UE  AE  IE  EE 

1 AU  IU  HJ 
UA  1 LA  EA 
* Ul  AI  1 EI 

UE  AE  IE  1 

b)  The  inverse  Matrix  M 1 is  calculated 

1 AU  IU  EU  _1 
_j  UA  1 IA  EA 

M Ul  AI  1 EI 

UE  AE  IE  1 


c)  Hie  coluan  Matrix  of  correlations  between  the  normalized  sanple,  P, 
and  the  vowels  above  are  calculated  as: 


■■mii 


d)  The  coefficients  of  expansion  for  the  normalized  sample  can  then  be 
obtained  as 
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It  can  be  shown  that  this  representation  is  equivalent  to  first  constructing 
a set  of  four  orthogonal  functions  and  then  representing  the  sample  by  these 
orthogonal  functions  as  long  as  the  choice  of  vowels  is  linearly  independent 
and  perceptually  consistent  with  section  II.  For  exanple  U,  A,  I and  aU  + bl 
cannot  be  chosen  as  primary  vowels,  as  M 1 will  vanish. 

4.  Categorization 

Categorization  is  aimed  at  circumventing  the  problem  of  speaker  vari- 
ations, in  the  manner  of  speaking,  accent  or  dialect.  The  differences  in 
vocal  characteristics  are  the  parts  that  are  removed  by  the  above  transforma- 
tions which  do  not  affect  non -phonetic  variabilities  such  as  dialectual  and 
habitual  idiosyncracies.  The  categorization  is  a process  by  which  a speaker 
can  be  placed  in  one  of  a few  categories  according  to  accent,  dialect  and 
habitual  differences.  Whether  a new  speaker  falls  into  one  of  the  chosen 
categories  is  determined  by  the  degree  of  closeness  with  which  his  cannon 
word  characteristics  match  those  in  the  category. 

5,  Data  Bank 

For  each  category  described  above,  we  store  in  our  data  bank  the  expan- 
sion coefficients  of  normalized  samples  belonging  to  the  vocabulary  to  be 
processed  by  the  recognizer.  These  are  gathered  from  the  speakers  belonging 
to  the  same  category.  In  general  there  are  a great  deal  of  similarities 
among  speakers  in  the  same  category.  In  those  cases  where  large  deviations 
occur  alternative  forms  of  the  same  word  resulting  from  idiosyncracies,  are 
stored.  The  systematic  gathering  of  data  is  a time -consuming  and  tedious 
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The  basic  idea  here  is  that  an  unknown  speaker's  templates  for  the 
vocabulary  words  can  be  simulated  by  knowing  the  following: 

a)  His  vowels  U,  A,  I.  and  E (or  5 if  occurs  in  the  vocabulary) 

b)  A category  closest  to  his  own  vowel  characteristics 

c)  The  coefficients  of  expansion  belonging  to  the  category. 

The  unknown  speaker's  primary  vowels  are  obtained  by  requiring  him  to 
say  a set  of  prescribed  camaon  words  such  as  "one  three  seven"  into  the 
system.  A better  set  would  be  "she  too  oust  learn"  but  tine-window  limita- 
tions did  not  permit  its  use  for  all  speakers  consistently.  Thus  the 
conaon  words  are  judiciously  chosen  but  they  are  not  necessarily  to  be 
taken  from  the  vocabulary  words  as  long  as  they  contain  the  necessary  vowels 
for  the  vocabulary.  A category  is  selected  for  the  unknown  speaker  by 
comparing  his  cannon  word  characteristics  with  those  in  the  existing  cate- 
gories. The  category  in  which  the  common  word  characteristics  are  most 
alike  is  taken  to  be  that  of  the  unknown  speaker. 

The  normalized  sanples  for  the  unknown  speaker  are  then  simulated  using 
the  equation 


where  a,  3,  y and  X are  coefficients  of  expansion  stored  in  the  data  bank 
under  the  category  closest  to  that  of  the  unknown  speaker.  The  sequential 
ordering  of  these  amputed  sanples  are  strictly  adhered  to.  In  this  way  the 
templates  of  the  vocabulary  words  are  created  for  the  unknown  speaker  with- 
out his  specifically  training  the  system.  Since  the  tenplates  are  created 
by  using  the  speaker's  own  category  fwctions  as  they  occur  in  his  own 
ccnmon  words,  only  a nail  amount  of  variability  remains  between  his  actual 
teaplates  and  the  ones  created  by  the  above  process. 


The  capability  of  simulating  the  templates  for  the  unknown  speaker 
raiders  a single  speaker  recognizer  conducive  to  multi -speaker  use.  The 
recognition  algorithm  in  this  system  is  therefore  designed  for  achieving 
high  accuracy  in  the  single  speaker  case  and  its  extension  to  multi -speaker 
use  is  through  transformation  to  categories  existing  in  the  data-bank. 

The  utterances  of  an  unknown  speaker  undergoes  signal  processing  as 
stated  earlieT.  The  resulting  normalized  samples  are  compared  with  those  in 
the  simulated  templates.  Time- sequence  of  these  samples  in  a given  block 
and  sequence  of  blocks  in  a given  word  are  strictly  maintained  throughout 
the  comparisons . The  section  of  an  utterance  that  compares  favorably  with 
certain  words , (the  figure  of  merit  exceeding  a prescribed  value) , is 
assumed  as  one  of  these  alternative  words  but  no  decision  is  yet  made.  We 
call  this  stage  the  preliminary  recognition  stage. 

8.  Final  Decision 

At  the  end  of  the  preliminary  recognition  stage  only  a few  possible 
outcomes  await  final  detision.  In  fact  if  the  figure  of  merit  is  set  high 
enough  most  of  the  words  are  already  reduced  to  a single  choice,  hence  they 
are  already  recognized.  There  are,  however,  few  remaining  cases  where 
further  decisions  are  to  be  made  to  resolve  conflicts  and  ambiguities.  These 
are  of  the  following  type: 

a)  A short  word  matching  with  part  of  another  word  and  causing 
a "phantom",  such  as  3 -*■  38,  7 -*■  71. 

b)  The  mmber  of  samples  being  too  large  and  reducing  the  score, 
due  to  mismatch,  such  as  7 ■*  ? 

c)  A long  template  "swallowing"  a short  word  due  to  fast  speaking, 

such  as  38  3. 

d)  The  parts  joining  two  words  triggering  a third  one,  such  as 

34  0. 

These  ambiguous  cases  are  resolved  by  what  we  call,  the  "final  editing" 
procedures.  For  example  3 -*■  38  is  corrected  for  the  "phantom"  3 by  a pro- 
gram which  will  not  allow  an  8 followed  by  a 3 unless  that  8 is  higher  in 
figure  of  merit  than  a preset  threshold.  This  threshold  is  such  that  it 
allows  a "true"  8,  hence  38  38  is  secured  whereas  3 38  is  corrected 
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into  3 -*■  3.  Similar  editing  procedures  are  applied  for  the  other  ambiguous 
cases.  These  procedures  are  explicitly  given  in  the  overall  recognition 
program.  Those  confusions  and  ambiguities  we  could  not  overcome  at  the 
present  time  are  considered  as  recognition  errors. 

Figure  4 shows  a schematic  diagram  of  the  system  configuration. 

9.  System  Hardware  Description 

A block  diagram  of  the  system  is  shown  in  Figure  5.  It  consists  of  a 
Digital  Equipment  Corporation  PDP-8E  computer  with  twenty  eight  thousand 
words  of  memory  and  an  Extended  Arithmetic  Element  Type  KE8-E,  512K  fixed 
head  disk,  dectape  drive,  a tektranics  CRT  graphics  terminal,  speech  pro- 
cessing circuits  and  the  interface  between  than  and  the  computer.  The 
system  is  activated  at  the  beginning  of  an  utterance,  processes  the  string 
of  words  and  prints  out  their  identity  after  the  end  of  the  utterance. 

The  signal  from  the  microphone  is  amplified,  high-frequency  pre- 
esphasized,  and  passed  through  the  24  dB/ octave  bandpass  filter  with  cut-off 
frequencies  at  250  Hz  and  5300  Hz.  From  the  resulting  signal,  two  types  of 
information  are  extracted,  the  spectral  distribution,  and  auxiliary  features, 

The  spectral  distribution  of  the  speech  signal  is  determined  by  passing 
it  through  the  bank  of  16  bandpass  filters,  the  outputs  of  which  are  recti- 
fied, smoothed,  sampled  every  10  msec,  and  stored  in  the  computer.  Linear 
combinations  of  these  16  channels  are  calculated  and  tabulated  in  the 
computer  to  form  the  data  points  in  the  four-dimensional,  frequency  domain 
representation. 

A specially  designed  circuit  involving  audio  compression  and  zero- 
crossing information  distinguishes  between  noise  and  an  utterance.  It 
provides  a binary  waveform  which  is  sampled  every  10  msec,  stored,  and, 
under  program  control,  used  to  determine  the  voicing  state  and  the  end  of 
the  utterance. 


V.  RESULTS 


The  results  are  presented  in  four  sections.  Each  section  describes 
evaluation  tests  conducted  to  test  a specific  stage  in  the  development  of 
the  speech  recognition  system.  The  first  two  sections  cover  preliminary 
evaluation  of  a partially  completed  system,  the  last  two  sections  shew  per- 
formance of  the  final  version  of  the  recognition  system. 

Recognition  Results  for  a Selected  "Hard-Set*'  of  Digits 

In  order  to  test  the  approach  under  a severe  condition  a set  of  diffi- 
cult connected  digits  was  established.  The  strings  of  digits  were  selected 
based  on  past  experience  with  other  methods  of  recognition.  The  strings 
were  chosen  because  they  presented  problems  in  previous  recognition  schemes 
due  to  coarticulation  and  stress. 

The  results  for  10  speakers  are  shown  below: 


Speaker  # of  Digits  % Correct 


B.P. 

48 

95.8 

R.V. 

42 

95.3 

A.K. 

75 

97.4 

L.F. 

53 

100.0 

W.B. 

75 

98.7 

H.K. 

54 

100.0 

N.J. 

69 

98.6 

W.S. 

75 

97.4 

R.W. 

39 

95.0 

F.D. 

30 

Total  561 

96.7 

The  results  show  overall  accuracy  of  97.41  for  individual  digits  and 
931  correct  sequences  for  the  "hard  set".  There  were  187  sets  and  13  errors. 
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The  preliminary  demonstration  was  based  an  twelve  sets  of  random  digits 
recorded  by  twelve  speakers,  two  of  which  were  from  RADC.  The  recordings 
consisted  of  twenty  sets  of  three  digit  strings.  Of  the  720  digits,  1 was 
an  omission  error  and  4 were  extraneous  errors  (phantoms) . The  overall 
accuracy  of  recognition  was  96.1%.  The  extraneous  errors  as  well  as  the 
omission  errors  are  due  to  the  fact  that  the  number  of  digits  in  a string 
is  not  known.  The  program  scans  through  the  data  and  any  number  of  digits 
is  likely  to  come  up. 

Close  examination  of  these  results  indicated  that  the  phonetic  editing 
programs  were  deficient.  The  phonetic  editing  programs  were  rewritten  and 
a set  of  editing  rules  was  implemented  before  the  final  evaluation. 

Final  Demonstration 

The  final  demonstration  consisted  of  a technical  session  and  a live 
demonstration  of  recognition  of  English  digits  in  connected  strings.  Each 
of  seven  speakers  read  a random  list  of  digits  in  groups  of  four,  three,  two 
or  one  digits  per  string.  Most  of  the  total  of  282  digits  were  in  groups  of 
three  digits  per  string  (70%) . The  other  30%  consisted  mostly  of  single  or 
double  digit  strings. 

The  demonstration  was  conducted  live  so  that  when  an  error  occurred  the 
speaker  repeated  the  same  string  again.  Using  this  procedure  it  was  possible 
to  test  whether  the  system  can  be  used  as  a practical  data  entry  system. 

The  overall  recognition  accuracy  for  the  seven  speakers  was  97.5%  per 
digit.  After  a single  repetition  of  each  of  the  error-strings  the  accuracy 
was  99.3%. 

Final  Evaluation 

In  order  to  obtain  a higher  level  of  confidence  in  the  results  obtained 
during  the  final  demonstration  the  system  was  retested  for  twenty  five  (25) 
male  speakers.  Each  speaker  recorded  a list  of  random  digits.  There  were 
two  sets  of  recordings  one  set  was  recorded  at  Perception  Technology  Corpo- 
ration and  contained  sixteen  (16)  speakers.  The  PTC  recording  consisted  of 


150  digits  per  speaker,  20  strings  of  triple  digits,  25  strings  of  double 
digits  and  40  single  digits  recorded  in  random  fashion.  The  RADC  recordings 
were  recorded  in  the  same  manner  for  nine  (9)  speakers  except  for  five  addi- 
tional strings  of  three  digits  for  a total  of  175  digits  per  person. 

The  results  indicate  that  only  three  of  the  twenty  five  speakers  were 
below  951.  The  average  recognition  score  including  all  sources  of  errors 
and  rejections  was  97.6%  per  digit. 
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VI.  BACKGROUND 
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Perception  Technolog/  Corporation  has  been  working  for  the  last 
several  years  toward  the  solution  of  the  problem  of  speech  perception  and 
recognition  uider  various  conditions  of  channel  distortion  and  speaker  vari- 
ability. Our  past  work  in  the  area  of  speech  perception  could  be  described 
as  an  effort  toward  invariant  extraction  of  relevant  parameters  of  human 
speech  under  various  conditions  of  variability.  Ihese  variabilities  are 
partly  external,  such  as  channel  noise  and  distortion,  and  partly  internal 
such  as  intra-  and  inter- speaker  variability,  interphonemic  interaction, 
accent,  and  dialect.  In  its  most  general  form  the  problem  is  formidable, 
especially  since  the  human  perceptual  process  is  not  completely  known  or 
understood. 

Perception  Technology  Corporation  pursued  the  solution  of  this  problem 
through  what  we  believe  an  effective  combination  of  theoretical  and  experi- 
mental research  into  speech  perception.  What  we  have  done  is  essentially 
the  application  of  the  "scientific  method"  so  well  known  to  be  operative  in 
positive  sciences,  namely,  to  first  formulate  a plausible  theory  of  the 
phenomenon  based  on  what  is  already  known,  and  then  test  this  theory  by  new 
experiments  it  suggests  to  find  out  more,  and  to  determine  its  limitations. 

As  new  facts  are  uncovered  one  is  then  in  a position  to  improve,  modify  or 
alter  the  original  theory  until  all  the  facts,  previously  known  and  newly 
uncovered,  may  be  summarized  by  the  improved  theory. 

Perception  Technology  Corporation  went  through  this  process,  and  by  so 
doing  produced  what  we  believe  a theoretically  coherent  and  experimentally 
viable  theory  of  speech  perception  at  the  phonetic  and  phonemic  level. 

Three  main  conclusions  of  the  theory  and  the  supporting  experimental 
data  are: 

a)  That  there  exists  a multidimensional  perceptional  space,  independent 
of  language  or  of  speaker,  in  which  speech  sounds  can  be  repre- 
sented. 
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b)  That  this  space  is  not  in  one-to-one  correspondence  with  the 
physical  signal-space  but  is  defined  up  to  some  adaptive  trans- 
formations. 

c)  That  at  least  in  the  phonemic  level  speech  sounds  tend  to  be  cate- 
gorized to  warrant  the  application  of  statistical  methods. 

We  have  shown  that  this  work  is  directly  applicable  to  the  objectives 
of  implementing  an  operational  continuous  speech  recognition  system.  Hie 
perceptual  space  delineates  the  extent  to  which  the  physical  signal  is  to 
be  expanded  into  linearly  independent  base  functions.  The  present  data 
shows  that  the  number  of  such  independent  functions  may  not  be  more  than 
4-5  for  speech  intelligibility.  For  full  naturalness  and  speaker  identity 
the  number  is  larger  but  probably  not  larger  than  10-15.  The  simplest  case 
of  3 independent  functions  was  extensively  studied.  The  case  of  4 indepen- 
dent functions  is  fouid  to  be  necessary  for  higher  accuracy  in  intelligi- 
bility aid  recognition  The  adaptive  transformations  imply  the  invariance 
of  perceptually  relevant  parameters  under  the  conditions  of  variability  such 
as  channel  distortion  and  speaker-to-speaker  variations.  The  adaptive  trans- 
formations have  the  dimensionality  of  the  space  itself,  namely,  if  4 indepen- 
dent functions  are  chosen  as  adequate  for  a given  purpose  then  the  trans- 
formations are  4x4  matrices.  The  intra-speaker  transformations  may  be 
viewed  as  a special  case  of  the  inter-speaker  transformations.  The  inter- 
phonemic  transformations  are  more  complex  in  nature  although  in  some  sense 
they  may  be  regarded  as  short-time  (context  dependent)  limit  of  the  intra- 
speaker transformations.  Finally  the  categorical  perception  of  speech  sound: 
imply  the  large  tendency  of  discreteness  of  perceptual  response  to  speech 
sounds,  especially  to  consonants. 


VII.  CONCLUSIONS 


The  present  system  demonstrates  that  speaker  transformations  reducing 
the  variabilities  due  to  vocal  characteristics  can  be  achieved  by  using  per- 
ceptual expansion  functions  and  their  transformations  from  speaker  to 
speaker.  The  key  element  is  the  choice  of  a linearly  independent  and  per- 
ceptually significant  set  of  functions.  If  the  number  of  functions  is  too 
small  the  representation  is  not  accurate  enough.  If  it  is  larger  than  per- 
ceptually required  for  speech  intelligibility  then  the  representation  is  too 
detailed,  which  makes  the  categories  very  large.  Furthermore , too  many 
functions  often  tend  to  be  not  linearly  independent  and  make  the  matrix 
inversion  meaningless . Four  linearly  independent  functions  seem  to  be  both 
perceptually  required  and  computationally  trouble  free.  Accuracies  achieved, 
although  somewhat  below  human  performance,  are  high  enough  to  warrant  further 
elaboration  of  the  system  to  produce  a practical,  highly  accurate  continuous 
speech  recognizer  for  a small  vocabulary. 

Many  improvements,  optimizations  and  refinements  that  we  can  incorporate 
were  not  possible  to  implement  due  to  program  limitations.  We  are  planning 
to  enlarge  the  time  window  so  that  a four  word  preamble  such  as  "She  too 
must  learn"  can  be  used  consistently.  This  would  also  allow  the  entry  of 
four  or  five  digit  strings  for  recognition.  The  parameters  introduced  for 
recognition  purposes  are  not  fully  optimized.  They  were  set  to  certain 
values  after  a limited  number  of  trials.  To  really  optimize  these  parameters 
we  must  fiTst  bring  the  machine  to  a real-time  operation.  The  real-time 
operation  is  also  essential  for  gathering  statistics  and  of  course  for  even- 
tual use  of  the  machine  for  practical  purposes. 

The  conclusion  seems  to  be  that  the  present  method  is  applicable  to  a 
multiplicity  of  speech  recognition  systems  and  reduces  the  variabilities  due 
to  speaker,  vocabulary  or  language  as  far  as  the  vocalic  (acoustic-phonetic) 
aspects  are  concerned.  Alternative  pronounciaticns  due  to  accent  and  dialect 
are  not  covered  by  the  transformations  when  the  variants  are  sufficiently 
different. 
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Figure  9:  "...ONE  TWO  THREE  FOUR..."  (First  Section) 


Figure  9a:  "...ONE  TWO  THREE  FOUR..."  (Second  Section) 


